6 research outputs found
SkipConvGAN: Monaural Speech Dereverberation using Generative Adversarial Networks via Complex Time-Frequency Masking
With the advancements in deep learning approaches, the performance of speech
enhancing systems in the presence of background noise have shown significant
improvements. However, improving the system's robustness against reverberation
is still a work in progress, as reverberation tends to cause loss of formant
structure due to smearing effects in time and frequency. A wide range of deep
learning-based systems either enhance the magnitude response and reuse the
distorted phase or enhance complex spectrogram using a complex time-frequency
mask. Though these approaches have demonstrated satisfactory performance, they
do not directly address the lost formant structure caused by reverberation. We
believe that retrieving the formant structure can help improve the efficiency
of existing systems. In this study, we propose SkipConvGAN - an extension of
our prior work SkipConvNet. The proposed system's generator network tries to
estimate an efficient complex time-frequency mask, while the discriminator
network aids in driving the generator to restore the lost formant structure. We
evaluate the performance of our proposed system on simulated and real
recordings of reverberant speech from the single-channel task of the REVERB
challenge corpus. The proposed system shows a consistent improvement across
multiple room configurations over other deep learning-based generative
adversarial frameworks.Comment: Published in: IEEE/ACM Transactions on Audio, Speech, and Language
Processing ( Volume: 30
Complex-Valued Time-Frequency Self-Attention for Speech Dereverberation
Several speech processing systems have demonstrated considerable performance
improvements when deep complex neural networks (DCNN) are coupled with
self-attention (SA) networks. However, the majority of DCNN-based studies on
speech dereverberation that employ self-attention do not explicitly account for
the inter-dependencies between real and imaginary features when computing
attention. In this study, we propose a complex-valued T-F attention (TFA)
module that models spectral and temporal dependencies by computing
two-dimensional attention maps across time and frequency dimensions. We
validate the effectiveness of our proposed complex-valued TFA module with the
deep complex convolutional recurrent network (DCCRN) using the REVERB challenge
corpus. Experimental findings indicate that integrating our complex-TFA module
with DCCRN improves overall speech quality and performance of back-end speech
applications, such as automatic speech recognition, compared to earlier
approaches for self-attention.Comment: Interspeech 2022: ISCA Best Student Paper Award Finalis
SpatialCodec: Neural Spatial Speech Coding
In this work, we address the challenge of encoding speech captured by a
microphone array using deep learning techniques with the aim of preserving and
accurately reconstructing crucial spatial cues embedded in multi-channel
recordings. We propose a neural spatial audio coding framework that achieves a
high compression ratio, leveraging single-channel neural sub-band codec and
SpatialCodec. Our approach encompasses two phases: (i) a neural sub-band codec
is designed to encode the reference channel with low bit rates, and (ii), a
SpatialCodec captures relative spatial information for accurate multi-channel
reconstruction at the decoder end. In addition, we also propose novel
evaluation metrics to assess the spatial cue preservation: (i) spatial
similarity, which calculates cosine similarity on a spatially intuitive
beamspace, and (ii), beamformed audio quality. Our system shows superior
spatial performance compared with high bitrate baselines and black-box neural
architecture. Demos are available at https://xzwy.github.io/SpatialCodecDemo.
Codes and models are available at https://github.com/XZWY/SpatialCodec.Comment: Paper in Submissio
Deep Neural Mel-Subband Beamformer for In-car Speech Separation
While current deep learning (DL)-based beamforming techniques have been
proved effective in speech separation, they are often designed to process
narrow-band (NB) frequencies independently which results in higher
computational costs and inference times, making them unsuitable for real-world
use. In this paper, we propose DL-based mel-subband spatio-temporal beamformer
to perform speech separation in a car environment with reduced computation cost
and inference time. As opposed to conventional subband (SB) approaches, our
framework uses a mel-scale based subband selection strategy which ensures a
fine-grained processing for lower frequencies where most speech formant
structure is present, and coarse-grained processing for higher frequencies. In
a recursive way, robust frame-level beamforming weights are determined for each
speaker location/zone in a car from the estimated subband speech and noise
covariance matrices. Furthermore, proposed framework also estimates and
suppresses any echoes from the loudspeaker(s) by using the echo reference
signals. We compare the performance of our proposed framework to several NB,
SB, and full-band (FB) processing techniques in terms of speech quality and
recognition metrics. Based on experimental evaluations on simulated and
real-world recordings, we find that our proposed framework achieves better
separation performance over all SB and FB approaches and achieves performance
closer to NB processing techniques while requiring lower computing cost.Comment: Submitted to ICASSP 202
Immersive Audio for Human-Machine Interface of Unmanned Ground Vehicles
An Immersive Audio Environment (IAE) system is designed for the application of Unmanned Ground Vehicles (UGV). The IAE system consists of a small sized microphone array, ADC, beamformers for the UGVs, Head Related Transfer Function (HRTF) filters, DAC, earphones for remote operators. The proposed IAE system is built by integrating commercial-off-the-shelf (COTS) products with a sound synthesis system and a head-hand direction tracking systems integrated to test the performance of the IAE system. The experiment results show that even with a small-sized Eigenmike R®, the integrated IAE system works well in terms of the accuracy of sound direction detected by human operators. The microphone array, beamformers, and sound volume have very small effects on the performance and accuracy, but the HRTFs have strong effect on the performance